1 Reference proteomes
1.1 SGD
Saccharomyces Genome Database
The reference genome sequences of the widely used Baker's yeast are maintained by tbe SGD, corresponding to the strain S. cerevisiae S288C.
The SGD also gather many key information such as:
- genome assembly
- Protein-coding sequence
- gene/protein nomenclature
- biological activity (interaction/regulation/complexes)
- curated annotations and descriptins (gene ontology)
- gene and protein expression
- accurate scientific literature,
- genotype/phenotype experiments ...
# YEAST REFERENCE SEQUENCES
## SGD
S288C = load.sgd.proteome(withORF=T) # Reference SGD protein sequences
S288C.cds = load.sgd.CDS(withORF=T) # Reference coding sequences (CDS)
SGD = load.sgd.features() # main SGD gene/protein features1.2 UniProt
Universal Protein resource
UniProt provides the most comprehensive resource of protein sequence and functional information.
Most protein features are typically stored on the database called UniProt KnowledgeBase (UniProtKB).
This database is a major hub for many specialized databases and webservers that helps understand proteins role.
Their reference proteome gather proteins for which expression or proof of physical existence has been shown.
## UNIPROT
# saveRDS(load.uniprot.features(),"data/uniprot-features.rds") # Takes >250sec for ~6700 ids
UNI.SC = load.uniprot.proteome('yeast') # Reference UniprotKB sequences (Uniref)
UNI = readRDS("data/uniprot-features.rds") %>% # main UNIPROT gene/protein features
group_by(UNIPROTKB) %>%
mutate( is_uniref = UNIPROTKB %in% names(UNI.SC), one2one = (n()==1) ) 1.3 YK11
1,011 Saccharomyces cerevisiae isolates
The 1002 Yeast Genomes project constitutes the most comprehensive genomic dataset on a single species of yeast. Ultimately provide the most extensive view of the genetic and phenotypic diversity within this model species to date.
The project was originally slated to include 1002 strains from diverse global locations, as well as a variety of ecological sources. A total of 1011 whole genome sequences have been produced and published in J. Peter et al., Nature 2018.
## 1011 strains
# saveRDS(load.1011.strains(),"data/proteome-1011-strains.rds") # Takes >5mn for 6575 ids
YK11 = readRDS("data/proteome-1011-strains.rds") # 1011 strains proteomes sequences
strains.info = load.jackson2018.data() # strains info from supp mat of Science paper
## REF: J. Peter et al., 2018, Science
## Genome evolution across 1,011 Saccharomyces cerevisiae isolates
##
-
/
All strains read were aligned on the reference yeast genome (ATCC 204508 / S288c). Therefore, all sequenced ORF should have exactly the same length as the references.
1.4 Statistics
Below we calculate some general statistics on the reference yeast proteome.
id_orf=names(S288C)
map_sgd2orf = SGD %>% dplyr::filter(name %in% id_orf) %>% dplyr::select(sgdid,name)
id_sgd = map_sgd2orf$sgdid
UNI.1 = UNI # Backup of Uniprot features table
UNI = left_join(UNI,map_sgd2orf, by=c('SGD'='sgdid')) %>% dplyr::rename(ORF=name)
id_uni = names(UNI.SC)
map_uni2sgd = UNI %>% dplyr::filter(UNIPROTKB %in% id_uni & one2one & !is.na(SGD)) %>% dplyr::select(UNIPROTKB,SGD,ORF)
id_y1k = names(YK11)
n_sgd = length(S288C) %>% as.character()
n_uni = length(UNI.SC) %>% as.character()
n_y1k = length(YK11) %>% as.character()Sequences of both SGD and UNIPROTKB proteomes are quite similar, yet, they don't share the same number of proteins.
SGD contains 6713 sequences while UNIPROTKB has 6049 sequences.
uniseq = UNI.SC[map_uni2sgd$UNIPROTKB]
sgdseq = S288C[map_uni2sgd$ORF]
aliref = align.pair.prot(uniseq, sgdseq) # Takes ~20s for 6042 ids
#saveRDS(pairwise.alignment.to.df(aliref),"data/sgduni-aligned-proteome.rds") # Takes ~8mn for 6042 ids
ALI_REF = readRDS("data/sgduni-aligned-proteome.rds")
ALI_STAT = group_by(ALI_REF,s1) %>%
summarise( not_aligned=sum(!aligned), OL = mean(ol.12), PID = mean(pid.long) )
## OVERALL PROTEOME ALIGNMENT
# 99.93% (+/- 1.44%) identity on average for 6014 proteins
n_aligned = sum(ALI_STAT$PID==100) %>% as.character()
pid_avg = mean(ALI_STAT$PID) %>% round(2) %>% as.character()
pid_sd = sd(ALI_STAT$PID) %>% round(2) %>% as.character()
## FOR MISALIGNED PROTEINS
n_misaligned = sum(ALI_STAT$PID<100) %>% as.character()
# 85.97% (+/- 16.10%) identity on average for 28 proteins
pid_avg.2 = mean(ALI_STAT$PID[ALI_STAT$not_aligned!=0]) %>% round(2) %>% as.character()
pid_sd.2 = sd(ALI_STAT$PID[ALI_STAT$not_aligned!=0]) %>% round(2) %>% as.character()
n_unaligned = sum(ALI_STAT$not_aligned) %>% as.character()On average, the mapped SGD/UNIPROT sequences share 99.93% identity (+/- 1.44%)
6014 proteins are exactly identical (i.e. 100% identity) between the two proteomes.
28 proteins contains mismatches/indels for at least 2811 residues.
The misaligned proteins share on average 85.97% identity (+/- 16.1%)
2 Proteome datasets
2.1 Annnotations
Gene ontology, SGD description, Uniprot commments
The protein localizations can be obtained from curated annotations from Gene Ontology (GO) and the subcellular locations from Uniprot.
GO = get.uniprot.go(id_uni)
##
##
## 'select()' returned 1:many mapping between keys and columns
## Joining, by = c("UNIPROT", "GO")
CC = GO %>% dplyr::filter(ONTOLOGY == CC)
LOC = get.uniprot.localization(annot = UNI,loc_to_columns = T)
DESC = get.uniprot.sgd(id_uni)
## 'select()' returned 1:many mapping between keys and columns
DESC.UNI = DESC %>%
left_join(UNI, by=c('UNIPROT'='UNIPROTKB','ORF'='ORF','SGD'='SGD')) %>%
dplyr::filter(one2one)2.2 Conservation
Site-specific evolutionary rates
Rate4Site is a program for detecting conserved amino-acid sites by computing the relative evolutionary rate for each site in the multiple sequence alignment (MSA).
For the fungi lineage, we calculated the evolutionary rates of the orthologs defined in Wapinsky et al., Nature 2007.
For the S. cerevisiae population, we calculated the rate of evolution at every site of the 1011 isolates proteomes.This rate is expected to be better than a non-synonymous SNP frequency.
Indeed, the phylogenetic tree of the 1011 strains is taken into account to balance for SNPs that occur in closely related strains relative to SNPs coming from the most recent common ancestor (MRCA).
WAP.r4s = load.wapinsky2007.data() # Evolutionary rate for fungi lineage (~20s to load)
## Get rate4site data for yeast based on wapinsky 2007 fungi lineage: 18.58 sec elapsed
#saveRDS(load.rate4site_1011.data(), "data/rate4site-1011-strains.rds") (~3mn to load)
K11.r4s = readRDS("data/rate4site-1011-strains.rds") # Evolutionary rate for 1011 strains
EVO = load.aligned.data(data.path="data/" ) # Protein dataset with aligned evolutionary rate
##
## Merged dataset with residue-level informations from:
## (1) SGD S288C proteome sequence
## (2) Uniprot reference yeast proteome (aligned to SGD)
## (3) Dubreuil et al. (2019) for Abundance, Disorder (IUP+D2P2) & stickiness (+aaindex)
## (4) Rate4Site based on fungi lineage Wapinsky et al. (2007) of 14 yeast species
## (5) 3Dcomplex for yeast quaternary structures (based on X-Ray from PDB)
FUNGI = WAP.r4s %>%
group_by(r4s_orf,r4s_size) %>%
mutate(R4S_norm = R4S_res + abs(min_(R4S_res))) %>%
summarise(r4s.fungi = mean_(R4S_norm),
max_r4s.fungi = max_(R4S_norm)
)
POP = K11.r4s %>% dplyr::filter( !is.na(ID) ) %>%
group_by(orf,len.s288c) %>%
summarise(r4s.yk11 = mean_(SCORE),
max_r4s.yk11 = max_(SCORE)
)
R4S = left_join(POP,FUNGI, by=c('orf'='r4s_orf', 'len.s288c'='r4s_size'))
R4S.desc = left_join(R4S,DESC.UNI,by=c('orf'='ORF'))
F1 = ggplot(R4S.desc) +
geom_point(aes(y=r4s.fungi,x=r4s.yk11,text=orf,func=FUNCTION), color='white', fill=NA,,shape=21,size=0.8) +
xlab("1011 strains evolutionary rate") + ylab('Fungi Evolutionary rate') +
scale_x_log10() +
theme_black()
## Warning: Ignoring unknown aesthetics: text, func
ggplotly(F1)2.3 Expression
Protein expression is taken from Ho et al., Cell Systems, 2018 which attempts to unify protein abundance datasets from the Protein Abundance database.
tRNA adapation index is calculated from the coding sequence (CDS) from SGD as in Dos Reis et al., NAR, 2004.
TAI = load.codon.usage(S288C.cds)
## REF: M. Dos Reis et al., 2004, Nucleic Acids Research
## Solving the riddle of codon usage preferences: a test for translational selection
## Loading required package: tAI
PAB = load.ho2018.data()
## REF: B. Ho et al., 2018, Cell Systems
## Unification of Protein Abundance Datasets Yields a Quantitative Saccharomyces cerevisiae Proteome
EXP = left_join(TAI,PAB, by=c('prot'='orf'))
#F2 = ggplot(EXP)3 Yeast SNP
3.1 SNP from 1011 strains
Using the translated protein sequences from the 1011 isolates proteomes, I have calculated the frequency of amino acids at every site of each ORF across strains.
SNP_FREQ=0.05
ref.file = 'data/YK11_REF.rds'
snp.file = 'data/YK11_SNP.rds'
ref.snp.file = 'data/YK11_REF_SNP.rds'
P = tibble( orf=names(YK11),
n_strains=lengths(YK11),
len = sapply(widths(YK11),unique)
)
if( !file.exists(ref.file) ){
task1="Extract WT amino-acid..."
cat(task1)
tic(task1)
REF = map_df(YK11, get.REF, p=SNP_FREQ)
saveRDS(REF,ref.file)
toc()
}else{
REF = readRDS(ref.file)
}
if( !file.exists(snp.file) ){
task2="Extract ALT amino-acid..."
cat(task2)
tic(task2)
SNP = map_df(YK11, get.SNP, p=SNP_FREQ)
saveRDS(SNP,snp.file)
toc()
}else{
SNP = readRDS(snp.file)
}
REF_SNP = left_join(REF,SNP,by=c('pos_ref'='pos_alt', 'orf'='orf'))
#saveRDS(REF_SNP,ref.snp.file)
#REF_SNP = readRDS('data/YK11_REF_SNP.rds')
TOTAL_SNP_VAR = nrow(REF_SNP) %>% as.character()
TOTAL_SNP_POS = nrow(REF) %>% as.character()
PROT_SNP = left_join(P,REF_SNP,by=c('orf'='orf'))
snppos = map_df( YK11, count.SNP.bypos, p=SNP_FREQ,nbin=100 )
snpbin = pivot_wider(snppos, id_cols = c('orf','bin.pos'), names_from=bin.pos, names_prefix = "b", values_from = n_snp)Some sites are ambiguous between strains, meaning that some strains may have an alternative amino-acid at that position.
To get the SNPs, I considered the frequency of all variants must be above a cutoff of 5%.
Since, I am looking at amino-acid frequencies we can only consider non-synonymous mutations.
Indels are not incorporated in the strains sequences.
A total of 38779 variants have been identified spread over 29585 positions of the proteome.
n_snp= PROT_SNP %>%
group_by(orf) %>%
summarize( nsnp = n_distinct(pos_ref), len=mean(len)) %>%
left_join(DESC,by=c('orf'='ORF'))
# Histogram for number of SNP
F3 = ggplot(n_snp) +
geom_histogram(aes(nsnp),binwidth = 1, fill='white',color='black') +
theme_black()
ggplotly(F3)
# SNP count vs Protein Length
F4 = ggplot(n_snp) +
geom_point(aes(y=nsnp,x=len,text=orf,func=FUNCTION), color='white', fill=NA,,shape=21,size=0.8) +
xlab("Protein Length (AA)") + ylab('SNP count') +
theme_black()
## Warning: Ignoring unknown aesthetics: text, func
ggplotly(F4)
# SNP count along protein length
snp_by_pos = snppos %>%
group_by(bin.pos) %>%
summarize( avgSNP=mean(n_snp), sdSNP=sd(n_snp), maxSNP=max(n_snp))
F5 = ggplot(snp_by_pos,aes(x=bin.pos,y=avgSNP)) +
#geom_point() +
geom_line(col='white') +
geom_line(aes(y=maxSNP),col='red') +
geom_linerange(aes(ymin=avgSNP-sdSNP,ymax=avgSNP+sdSNP),col='white') +
geom_vline(xintercept=seq(5,95,by=5),size=0.1,linetype='dashed') +
ylab('SNP count') + xlab('Sequence position (%)') +
theme_black()
ggplotly(F5)
#p = grid.arrange(p0,p1,ncol=2)Similarly, I calculated the number of "early-STOP" codons, i.e. STOP codons which occur before the last amino acid.
SNP.PROT = left_join(n_snp, EXP , by=c('orf'='prot')) %>% # Add expression data
left_join(R4S, by=c('orf'='orf'))# %>% # Add conservation data
#left_join(DESC.UNI,by=c('orf'='ORF')) # Add annotation data
# SNP count vs Protein Length
F6 = ggplot(SNP.PROT) +
geom_point(aes(y=nsnp,x=r4s.fungi,text=orf,func=FUNCTION), color='white', fill=NA,,shape=21,size=0.8) +
xlab("Fungi Evolutionary rate") + ylab('SNP count') +
scale_y_continuous(trans='log2') +
theme_black()
## Warning: Ignoring unknown aesthetics: text, func
F7 = ggplot(SNP.PROT) +
geom_point(aes(y=nsnp,x=r4s.yk11,text=orf,func=FUNCTION), color='white', fill=NA,,shape=21,size=0.8) +
xlab("1011 strains evolutionary rate") + ylab('SNP count') +
scale_x_log10() + scale_y_continuous(trans='log2') +
theme_black()
## Warning: Ignoring unknown aesthetics: text, func
ggplotly(F6)ggplotly(F7)3.2 Variant viewer (Testing... in progress)
4 References
Jackson et al.: Genome evolution across 1,011 Saccharomyces cerevisiae isolates
Rate4site: Comparison of site-specific rate-inference methods: Bayesian methods are superior
Wapinski et al.: Natural history and evolutionary principles of gene duplication in fungi
Ho et al.: Unification of Protein Abundance Datasets Yields a Quantitative Saccharomyces cerevisiae Proteome
Dos Reis et al.: Solving the riddle of codon usage preferences: a test for translational selection
Session Info
sessionInfo()
## R version 3.6.2 (2019-12-12)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Linux Mint 19
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
##
## locale:
## [1] LC_CTYPE=en_IL.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_IL.UTF-8 LC_COLLATE=en_IL.UTF-8
## [5] LC_MONETARY=en_IL.UTF-8 LC_MESSAGES=en_IL.UTF-8
## [7] LC_PAPER=en_IL.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_IL.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4 parallel stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] shiny_1.5.0 tAI_0.2 GO.db_3.10.0
## [4] org.Sc.sgd.db_3.10.0 openxlsx_4.1.5 hablar_0.3.0
## [7] Biostrings_2.54.0 forcats_0.5.0 stringr_1.4.0
## [10] dplyr_1.0.2 purrr_0.3.4 readr_1.3.1
## [13] tidyr_1.0.3 tibble_3.0.4 tidyverse_1.3.0
## [16] tictoc_1.0 plotly_4.9.3 ggrepel_0.8.2
## [19] ggpubr_0.3.0 ggsci_2.9 ggthemes_4.2.0
## [22] gridExtra_2.3 ggplot2_3.3.2 AnnotationDbi_1.48.0
## [25] hutils_1.6.0 XVector_0.26.0 IRanges_2.20.2
## [28] S4Vectors_0.24.4 Biobase_2.46.0 BiocGenerics_0.32.0
## [31] xfun_0.20
##
## loaded via a namespace (and not attached):
## [1] fs_1.4.1 lubridate_1.7.9 bit64_0.9-7 httr_1.4.1
## [5] tools_3.6.2 backports_1.1.6 R6_2.4.1 DBI_1.1.0
## [9] lazyeval_0.2.2 colorspace_1.4-1 withr_2.2.0 tidyselect_1.1.0
## [13] bit_1.1-15.1 curl_4.3 compiler_3.6.2 cli_2.0.2
## [17] rvest_0.3.5 xml2_1.3.2 labeling_0.3 bookdown_0.21
## [21] scales_1.1.1 digest_0.6.25 foreign_0.8-72 rmarkdown_2.6
## [25] rio_0.5.16 pkgconfig_2.0.3 htmltools_0.5.0 fastmap_1.0.1
## [29] dbplyr_1.4.3 htmlwidgets_1.5.3 rlang_0.4.7 readxl_1.3.1
## [33] rstudioapi_0.11 RSQLite_2.2.0 generics_0.0.2 jsonlite_1.6.1
## [37] crosstalk_1.0.0 zip_2.0.4 car_3.0-6 magrittr_1.5
## [41] fansi_0.4.1 Rcpp_1.0.4.6 munsell_0.5.0 abind_1.4-5
## [45] lifecycle_0.2.0 stringi_1.4.6 yaml_2.2.1 carData_3.0-3
## [49] zlibbioc_1.32.0 grid_3.6.2 blob_1.2.1 promises_1.1.0
## [53] crayon_1.3.4 haven_2.2.0 hms_0.5.3 knitr_1.28
## [57] pillar_1.4.4 ggsignif_0.6.0 fastmatch_1.1-0 reprex_0.3.0
## [61] glue_1.4.2 evaluate_0.14 data.table_1.12.8 modelr_0.1.7
## [65] httpuv_1.5.2 vctrs_0.3.4 rmdformats_1.0.0 cellranger_1.1.0
## [69] gtable_0.3.0 assertthat_0.2.1 mime_0.9 xtable_1.8-4
## [73] broom_0.7.2 later_1.0.0 rstatix_0.5.0 viridisLite_0.3.0
## [77] memoise_1.1.0 ellipsis_0.3.0